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ABSTRACT 

MapReduce is becoming the de facto framework for storing and 
processing massive data, due to its excellent scalability, reliability, 
and elasticity. In many MapReduce applications, obtaining a com- 
pact accurate summary of data is essential. Among various data 
summarization tools, histograms have proven to be particularly im- 
portant and useful for summarizing data, and the wavelet histogram 
is one of the most widely used histograms. In this paper, we in- 
vestigate the problem of building wavelet histograms efficiently on 
large datasets in MapReduce. We measure the efficiency of the al- 
gorithms by both end-to-end running time and communication cost. 
We demonstrate straightforward adaptations of existing exact and 
approximate methods for building wavelet histograms to MapRe- 
duce clusters are highly inefficient. To that end, we design new al- 
gorithms for computing exact and approximate wavelet histograms 
and discuss their implementation in MapReduce. We illustrate our 
techniques in Hadoop, and compare to baseline solutions with ex- 
tensive experiments performed in a heterogeneous Hadoop cluster 
of 16 nodes, using large real and synthetic datasets, up to hundreds 
of gigabytes. The results suggest significant (often orders of magni- 
tude) performance improvement achieved by our new algorithms. 

1. INTRODUCTION 

MapReduce is becoming the de facto framework for storing and 
processing massive data, due to its excellent scalability, reliabil- 
ity, and elasticity [15]. Efficient data processing in MapReduce 
has received lots of attention since its debut. Development for 
its open-source realization, Hadoop [22], has been particularly ac- 
tive, e.g., HadoopDB [1], Hadoop-l-l- [16], MapReduce Online [1 1], 
Pig [19,29], Hive [36] and others. Datasets stored and processed in 
Hadoop or any MapReduce platform are usually enormous, rang- 
ing from tens of gigabytes to terabytes [15, 31]. Hence, in many 
MapReduce applications, obtaining a compact accurate summary 
of a dataset is important. Such a summary captures essential statis- 
tical properties of the underlying data distribution, and offers quick 
insight on the gigantic dataset, provided we can compute it effi- 
ciently. For example, this allows other MapReduce jobs over the 
same dataset to better partition the dataset utilizing its histogram 
which leads to better load-balancing in the MapReduce cluster [15]. 
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In traditional database systems and many modern data manage- 
ment applications, an important useful summary on large datasets 
is the histogram [32]. Suppose the keys of a dataset are drawn from 
finite domain [u] — {!,■■■ ,u}. Broadly speaking, a histogram on 
the dataset is any compact, possibly lossy, representation of its fre- 
quency vector V = (v(l), . . . , v(it)), where v(a;) is the number of 
occurrences of key x in the dataset. There are many different his- 
tograms depending on the form this compact representation takes. 
One popular choice is the wavelet histogram [26]. Treating v as 
a signal, the wavelet histogram consists of the top-fc wavelet co- 
efficients of V in terms of their magnitudes (absolute values), for 
a parameter k. As most real-world distributions have few large 
wavelet coefficients with others close to zero, retaining only the k 
largest yields a fairly accurate representation of v. Due to its sim- 
plicity, good accuracy, and a variety of applications in data analysis 
and query optimization, wavelet histograms have been extensively 
studied. Efficient algorithms are well known for building a wavelet 
histogram on offline data [26, 27] and for dynamically maintaining 
it in an online or streaming [13,20,27] fashion. 

In this work, we study how to efficiently build wavelet histograms 
for large datasets in MapReduce. We utilize Hadoop to demonstrate 
our ideas, which should extend to any other MapReduce imple- 
mentation. We measure the efficiency of all algorithms in terms of 
end-to-end running time (affected by the computation and 10 costs) 
and intra-cluster communication (since network bandwidth is also 
scarce in large data centers running MapReduce [15], whose usage 
needs to be optimized). Note that communication cost might not be 
significant when running only one particular MapReduce job (this 
is often the case); however, in a busy data center/cluster where nu- 
merous jobs might be running simultaneously, the aggregated effect 
from the total communications of these jobs is still critical. 

We show straightforward adaptations of both exact and approxi- 
mate wavelet histogram construction methods from traditional data 
management systems and data mining fields to MapReduce clusters 
are highly inefficient, mainly since data is stored in a distributed file 
system, e.g., the Hadoop Distributed File System (HDFS). 

Contributions. We propose novel exact and approximation algo- 
rithms tailored to MapReduce clusters, in particular Hadoop, which 
outperform straightforward adaptations of existing methods by sev- 
eral orders of magnitude in performance. Specifically, we: 

• present a straightforward adaptation of the exact method in 
Hadoop, and a new exact method that can be efficiently in- 
stantiated in MapReduce in Section 3; 

• show how to apply existing, sketch-based approximation al- 
gorithms in Hadoop, and discuss their shortcomings. We de- 
sign a novel random sampling scheme to compute approxi- 
mate wavelet histograms efficiently in Hadoop in Section 4; 
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• conduct extensive experiments on large (up to 400GB) data- 
sets in a heterogeneous Hadoop cluster with 16 nodes in Sec- 
tion 5. The experimental results demonstrate convincing re- 
sults that both our exact and approximation methods have 
outperformed their counterparts by several orders of magni- 
tude. 

In addition, we introduce necessary background on MapReduce, 
Hadoop, and wavelet histograms in Section 2, survey related work 
in Section 6, and conclude in Section 7. 

2. PRELIMINARIES 
2.1 Wavelet Basics 

Suppose each record in the dataset has a key drawn from domain 
[w] = {1, • ■ • ,u}, and we want to build a wavelet histogram on the 
keys. Define the frequency vector as v = (v(l), . . . , v(ii)) where 
v(a::) is the number of occurrences of key x in the dataset. The idea 
of building a histogram using wavelets is to consider v as a sig- 
nal and apply a wavelet transformation. For most applications, one 
usually adopts the simplest Haar wavelet basis [13, 18,20,26,27], 
which is defined as follows. We first average values pairwise to 
obtain the average coefficients, i.e. [(v(2) + v(l))/2, (v(4) + 
v(3))/2, . . . , (v(?i) + v(u - l))/2]. We also retain the aver- 
age difference of the pairwise values, i.e. [(v(2) — v(l))/2, . . . , 
(v(ii) — v(ti — l))/2], which are called the detail coefficients. 
Clearly, given these vectors one can reconstruct the original signal 
V exactly. We recursively apply this pairwise averaging and dif- 
ferencing process on the average coefficients vector until we reach 
the overall average for v. The Haar wavelet coefficients of v are 
given by the overall average, followed by the detail coefficients in 
a binary tree, as shown by example in Figure 1, where the leaf level 
of the tree (level £ = log u) is the original signal. To preserve the 
energy of the signal (v's L2 norm), one must multiply coefficients 
in level ^ by a scaling factor \fuf^. 

If 1 (g^ total average 
,^j-j^caling factor: \f^\^^ 

(2j^r^^^ £ = 1 

S S S S S ^ E£ = 3 
Figure 1: Wavelet coefficients. 

This transformation is lossless as we can reconstruct v exactly 
from all u wavelet coefficients. However, the main reason wavelets 
are popular and powerful in signal processing is, for most real- 
world signals v, most of its wavelet coefficients are near zero. 
Thus if for a parameter k we keep only the k wavelet coefficients 
of largest magnitude while assuming others are zero, we can still 
reconstruct the original signal reasonably well. Since energy is 
preserved under wavelet transform, i.e., ||v||2 = X]"=i '^(0^ ~ 
X]"=i keeping the k wavelet coefficients of largest magnitude 
minimizes energy loss for all fc-term wavelet representations of v. 
The best A:-term wavelet representation can be computed efficiently 
in a centralized setting [26]: Assuming entries in frequency vec- 
tor V are given in order, one can compute all wavelet coefficients 
bottom-up in 0(u) time. Then, using a priority queue of size fc, we 
can find the k coefficients of largest magnitude in one pass over all 
u coefficients, taking time 0(u log fc). 
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Figure 2: Coefficients by wavelet basis vectors 



Another method to compute wavelet coefficients, especially in 
streaming settings, is to use wavelet basis vectors. The first wavelet 
basis vector is -!/>! = [1, . . . , l]/y'M. To define the other u — 1 
basis vectors, we first introduce, for j = 1, . . . , log it and k = 
0, . . . , 2^ - 1, the vector <?!>j,fe(0 = 1 for fc(w/2-') + 1 < « < 
k{u/2-') + u/2-' , and elsewhere. For j = 0, . . . , \ogu — 1 and 

= 0, . . . , 2-' — 1, we define the i-th wavelet basis vector for 
j = 2^ + fc + 1 as = {~(i>j+i,2k + <^j+i,2fe+i)/ W2^' where 
is a scaling factor. The wavelet coefficients are the dot 
products of V with these wavelet basis vectors, i.e., Wi — (v, tpi), 
foi i — 1, . . . , It; see Figure 2 for an illustration. 

Wavelets provide a compact approximation of a data distribution. 
It serves a variety of data analysis tasks such as range selectivity es- 
timation [26], approximating queries [9] and many other data min- 
ing applications [3,21,33]. As we are concerned with constructing 
a best fc-term wavelet representation, we will not talk about its use 
which has already been well studied [26]. 

Wavelet histograms also extend to multi-dimensional signals or 
datasets. Consider the two-dimensional case where keys are drawn 
from two-dimensional domain [u]'^, defining a two-dimensional fre- 
quency aiTay v — {'v(x, y)), 1 < x,y < u. A I'D wavelet trans- 
form first applies a standard ID wavelet transform to each row of v. 
Then, using the ID wavelet coefficients as inputs, we apply a sec- 
ond round of ID wavelet transforms to each column of the array. 
This process can be similarly extended to d dimensions. 

2.2 Hadoop Basics 

For this work we assume Hadoop's default file system HDFS. 
A cluster using HDFS consists of multiple DataNodes, for storing 
file system data, and a single master node designated as the Na- 
meNode which oversees all file operations and maintains all file 
meta-data. A file in HDFS is split into data chunks, 64MB in size 
by default, which are allocated to DataNodes by the NameNode. 
Chunks are typically replicated to multiple DataNodes, based on 
the file replication ratio, to increase data availability and fault toler- 
ance. In this work and many other studies where fault tolerance is 
not the main subject of interest, the replication ratio is set to 1 and 
machine failure is not considered. The MapReduce core consists of 
one master JobTracker task and many TaskTracker tasks. Typical 
configurations run the lobTracker and NameNode on the same ma- 
chine, called the master, and run TaskTracker and DataNode tasks 
on other machines, called slaves. 

Typical MapReduce jobs consist of three phases: Map, Sort-and- 
Shuffle, and Reduce. The user may specify rn, the desired number 
of Mapper tasks, and r, the number of Reducer tasks before starting 
the job. Next we look at the three phases in detail. 

Map phase. In the Map phase, the m Mappers run in parallel 
on different TaskTrackers over different logical portions of an in- 
put file, called splits. Splits typically, but not always, correspond to 
physical data chunks. Hadoop allows users to specify the InputFor- 
mat for a file, which determines how splits are created and defines 
a RecordReader for reading data from a split. 

After splits have been formed, the JobTracker assigns each avail- 
able Mapper a split to process. By default, the scheduler attempts 
to schedule Data-Local Mappers by assigning a Mapper a locally 
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stored split. There are also cases which call for Non-Data-Local 
Mappers, i.e., when a node is idle and has no local split to process. 
Then, a MapRunner is started which obtains a RecordReader and 
invokes the Map function for each record in the split. A Mapper 
then maps input key- value pairs {ki,vi) from its split to intermedi- 
ate key- value pairs (^2, «2). As a Mapper proceeds, it maintains an 
in-memory buffer of the (^2, W2). When the buffer fills to thresh- 
old, pairs are partitioned, sorted, and optionally processed by the 
Combine function, which outputs locally aggregated {k2,V2) pairs 
(aggregation on 112 's with the same key fc2). Pairs are then spilled 
to their corresponding logical partitions on the local disk. The par- 
titions are defined by a Partition function, typically a hash function 
like hash{k2) mod r, which determines the Reducer task which 
will process a particular ^2 later. When the Mapper ends, all emit- 
ted (^2, W2) have been partitioned, sorted (w.r.t. fe2), and optionally 
combined. One can also define a Close interface which executes at 
the end of the Mapper. 

Shuffle-and-Sort phase. In the Shuffle-and-Sort Phase, each Re- 
ducer copies all (^2, W2) it is responsible for (as designated by the 
Partition function) from all DataNodes. It then sorts all received 
(/c2, ^2) by k2 so all occurrences of key ^2 are grouped together. 
An external sort is needed if the (^2, "2) do not fit in memory. 

Reduce phase. After all (^2, W2) are collected and sorted, a Re- 
ducer iterates over all its {k2,V2). For each distinct key fe, the 
Reducer passes all corresponding V2 values to the Reduce function. 
Then the Reduce function produces a final key-value pair (ks, v^) 
for every intermediate key k2. As in the Map phase, one can imple- 
ment a Close interface which is executed at the end of the Reducer. 

3. EXACT COMPUTATION 

Baseline solutions Let n be the total number of records in the entire 
dataset, where each record has a key drawn from key domain [u] . 
Note either n 3> it or n <C m is possible. Recall in Hadoop the 
n records are partitioned into m splits, processed by m Mappers, 
possibly on different machines, which emit intermediate key-value 
pairs for processing by Reducers. Thus, one baseline solution to 
compute the wavelet representation is to compute, for each split 
J = 1, . . . , m, its local frequency vector Vj, and emit a [x, Vj(a;)) 
pair for each key x in the split. The Reducer then can aggregate 
the local frequencies, producing the overall frequency vector v = 
J Vj where Vj [x) = if key x does not appear in the jth 
split. Finally, we compute the best A: -term wavelet representation 
of V using the centralized algorithm (e.g., [26]). 

We observe that each wavelet coefficient Wi — (v, tpi) can be 
written as 

/ m \ m 

\j=i / i=i 

i.e., Wi is the summation of the corresponding local wavelet coef- 
ficients of frequency vectors for the m splits. Then, an alternate 
approach to compute the exact wavelet coefficients is to compute, 
for each split j — 1 , . . . , m its local frequency vector Vj . The local 
coefficients Wij = (vj , ijji) are computed for each split's local fre- 
quency vector -Vj and a {i, Wij) pair is emitted for each non-zero 
Wij. The Reducer can then determine the exact Wi as X^jLi "^iJ 
where Wi,j = if the Reducer does not receive a Wi,j from the 
jth split. After computing all complete Wi the Reducer selects the 
best fc-term wavelet representation, i.e. by selecting the top-fc co- 
efficients of largest absolute value. 

A big drawback of the baseline solutions is they generate too 
many intermediate key-value pairs, 0{mu) of them to be precise. 



This consumes too much network bandwidth, which is a scarce re- 
source in large data clusters shared by many MapReduce jobs [15]. 

A new algorithm. Since Wi is the summation of the corresponding 
local wavelet coefficients of frequency vectors for the m splits, if 
we first compute the local coefficients Wi,j = {■Vj,tl)i), the prob- 
lem is essentially a distributed top-fc problem. A major difference 
is in the standard distributed top-fc problem, all local "scores" are 
non-negative, while in our case, wavelet coefficients can be positive 
and negative, and we want to find the top-fc aggregated coefficients 
of largest absolute value (magnitude). Negative scores and finding 
largest absolute values are a problem for existing top-fc algorithms 
such as TPUT and others [7, 17,28], as they use a "partial sum" to 
prune items which cannot be in the top-fc. That is, if we have seen 
(at the coordinator) t local scores for an item out of m total local 
scores, we compute a partial sum for it assuming its other m — t 
scores are zero. When we see fc such partial sums we use the fcth 
largest partial sum as a threshold, denoted r, to prune other items: 
If an item's local score is always below r/m at all sites, it can be 
pruned as it cannot get a total score larger than r to get in the top-fc. 
If there are negative scores and when the goal is to find largest ab- 
solute values, we cannot compute such a threshold as unseen scores 
may be very negative. 

We next present a distributed algorithm which handles positive 
and negative scores (coefficients) and returns the top-fc aggregated 
scores of largest magnitude. The algorithm is based on algorithm 
TPUT [7], and can be seen as interleaving two instances of TPUT. 
As TPUT, Our algorithm requires three rounds. For an item x, r{x) 
denotes its aggregated score and rj{x) is its score at node j. 

Round 1: Each node first emits the fc highest and fc lowest (i.e., 
most negative) scored items. For each item x seen at the coordina- 
tor, we compute a lower bound t{x) on its total score's magnitude 
|r(a;)| (i.e., \r{x)\ > t{x)), as follows. We first compute an up- 
per bound r+ (x) and a lower bound r ^ (x) on its total score r{x) 
(i.e., T~ (x) < r{x) < t^{x)): If a node sends out the score of 
X, we add its exact score. Otherwise, for t^{x), we add the fc-th 
highest score this node sends out and for {x) we add the fc-th 
lowest score. Then we set t{x) = if t^{x) and t~ (x) have dif- 
ferent signs and t{x) = min{|r^(a::)|, otherwise. Doing 
so ensures t~ (x) < r{x) < t'^{x) and |r(a::)| > t{x). 

Now, we pick the fc-th largest t(x), denoted as Ti. This is a 
threshold for the magnitude of the top-fc items. 

Round 2: A node j next emits all local items x having [^^(a;)! > 
Ti/m. This ensures an item in the true top-fc in magnitude must be 
sent by at least one node after this round, because if an item is not 
sent, its aggregated score's magnitude can be no higher than Ti. 

Now, with more scores available from each node, we refine the 
upper and lower bounds t'^{x), t~ (x), hence t{x), as previously 
for each item x £ R, where R is the set of items ever received. If 
a node did not send the score for some x, we can now use Ti/m 
(resp. — Ti / m) for computing r"*" (x) (resp. t~ (x)). This produces 
a new better threshold, T2 (calculated in the same way as comput- 
ing Ti with improved r(x)'s), on the top-fc items' magnitude. 

Next, we further prune items from R. For any x G Rv/e. com- 
pute its new threshold t'{x) = max{|T^(a;)|, |r~(a;)|} based on 
refined upper and lower bounds t'^{x), t~ (x). We delete item x 
from R if t'{x) < T2. The final top-fc items must be in the set R. 

Round 3: Finally, we ask each node for the scores of all items in 
R. Then we compute the aggregated scores exactly for these items, 
from which we pick the fc items of largest magnitude. 

A simple optimization is, in Round 2, to not send an item's local 
score if it is in the local top-fc/bottom-fc sets, even if |r'j(a;)| > 
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Ti/m\ these scores were sent in Round 1. Also in Round 3, a 
node can send an item's local score only if it was not sent to the 
coordinator in previous rounds (using simple local bookkeeping). 

Multi-dimensional wavelets. It is straightforward to extend our al- 
gorithms to build multi-dimensional wavelet histograms. Consider 
the two-dimensional case. Recall in this case frequency vector v 
is a 2D array. A 2D wavelet transform applies two rounds of ID 
wavelet transforms on the rows and then the columns of v. Since 
each wavelet transform is a linear transformation, the resulting 2D 
wavelet coefficients are still linear transformations of v. So if we 
apply a 2D wavelet transform to each split, any 2D wavelet coef- 
ficient is still a summation of corresponding 2D coefficients of all 
splits. Thus, we can still run the modified TPUT algorithm to find 
the top-fc coefficients of largest magnitude as before. 

System issues. At a high-level, for a dataset in HDFS with m splits, 
we assign one Mapper task per split and each Mapper acts as a dis- 
tributed node. We use one Reducer as the coordinator. We imple- 
ment our three-round algorithm in three rounds of MapReduce in 
Hadoop. To be consistent across rounds, we identify each split with 
its unique offset in the original input file. 

Two technical issues must be dealt with when implementing the 
algorithm in Hadoop. First, the algorithm is designed assuming 
the coordinator and distributed nodes are capable of bi-directional 
communication. However, in MapReduce data normally flows in 
one direction, from Mappers to Reducers. In order to have two- 
way communication we utilize two Hadoop features: the Job Con- 
figuration and Distributed Cache. The Job Configuration is a small 
piece of information communicated to every Mapper and Reducer 
task during task initialization. It contains some global configuration 
variables for Mappers and Reducers. The Job Configuration is good 
for communicating a small amount of information. If large amounts 
of data must be communicated, we use Hadoop's Distributed Cache 
feature. A file can be submitted to the master for placement into the 
Distributed Cache. Then, Distributed Cache content is replicated to 
all slaves during the MapReduce job initialization. 

Second, the distributed nodes and the coordinator in the algo- 
rithm need to keep persistent state across three rounds. To do so, at 
the end of a Mapper task handling an input split, via its Close inter- 
face, we write all necessary state information to an HDFS file with 
a file name identifiable by the split's id. When this split is assigned 
to a Mapper in a subsequent round, the Mapper can then restore the 
state information from the file. Note Hadoop always tries to write 
an HDFS file locally if possible, i.e., state information is usually 
saved on the same machine holding the split, so saving state infor- 
mation in an HDFS file incurs almost no extra communication cost. 
For the Reducer which acts as the coordinator, since there is no split 
associated to it, we choose to customize the JobTracker scheduler 
so the Reducer is always executed on a designated machine. Thus, 
the coordinator's state information is saved locally on this machine. 

We detail how we address these challenges in Appendix A. 

4. APPROXIMATE COMPUTATION 

We observe the exact computation of the best fc-term wavelet 
representation in Hadoop is expensive. Although our improved al- 
gorithm avoids emitting all local frequency vectors, it could still 
be expensive due to the following: (1) The (modified) TPUT al- 
gorithm could still send out a lot of communication, though better 
than sending all local frequency vectors; (2) it needs 3 rounds of 
MapReduce, which incurs a lot of overhead; and (3) most impor- 
tantly, every split needs to be scanned to compute local frequency 
vector Vj and compute local wavelet coefficients w'i.j. This mo- 
tivates us to explore approximation algorithms which compute a 



A:-term wavelet representation which may not be the best one, but 
still approximates the underlying data distribution reasonably well. 

There are many design choices for approximate computation of 
wavelets. Here are some natural attempts: (i) We can replace TPUT 
with an approximate top-fc algorithm [28, 30], after appropriate 
modification to handle negative scores. This resolves issue (1) but 
not (2) and (3). (ii) We can approximate local wavelet coefficients 
of each split using a sketch as in [13, 20], and then send out and 
combine the sketches, due to the property that these sketches are 
linearly combinable. This resolves issues (1) and (2), but not (3), 
as computing a sketch still needs to scan the data once, (iii) Lastly, 
a generic approach is random sampling, that is, we take a random 
sample of the keys and construct the wavelets on the sample, as 
the sample approximates the underlying data distribution well for 
a sufficiently large sample size. Then a wavelet representation can 
be constructed on the frequency vector of the sample. 

Among the possibilities, only (iii) resolves all three issues si- 
multaneously. It requires only one round, clearing issue (2). It also 
avoids reading the entire data set, clearing issue (3). However, it 
may result in a lot of communication, as it is well known to ap- 
proximate each (global) frequency v(x) with a standard deviation 
of en (recall n is the number of records in the entire dataset), a sam- 
ple of size Q{l/e^) is required [37]. More precisely, for a sample 
probability p = l/(e^7i) (a sample of expected size pn — 1/e^), 
one can show v(a;) = s{x)/p is an unbiased estimator of v(a;) 
with standard deviation 0(en) for any x, where s is the frequency 
vector of the sample. After that, we construct a wavelet represen- 
tation on the estimated frequency vector v. As n is the size of the 
entire data set, which is usually extremely large (for MapReduce 
clusters), e needs to be fairly small for v to approximate v well, 
usually on the order of 10^'* to 10^*. The total communication 
cost of this basic sampling method is 0(l/e^), even with one-byte 
keys, this corresponds to 100MB to 1TB of data being emitted to 
the network! 

A straightforward improvement is to summarize the sampled keys 
of a split before emitting them, which is actually used as a simple 
optimization for executing any MapReduce job [15]. We aggregate 
the keys with the Combine function, that is, if the split is emitting c 
pairs (a;, 1) for the same key, they are aggregated as one pair {x,c). 
This optimization indeed reduces communication cost, but its ef- 
fectiveness highly depends on the data distribution, in the worst 
case it may not reduce the communication at all. 

A slightly better idea is to ignore those sampled keys with low 
frequencies in a split, which we denote as the improved sampling 
algorithm. More precisely, we only send out a sampled key x 
and its sampled count Bj{x) if Sj{x) > etj, where tj is the to- 
tal number of sampled records in split j. Thus the overall error 
in the total count of a sampled key x from all splits is at most 
J^J^j^ etj ~ epn = 1/e, which translates into an {l/e)/p = 
en error in the estimated frequency v(2:). Thus it adds another 
en to the standard deviation, which is still 0{en). Note that the 
total number of key-value pairs sent out by one split is at most 
tj/(etj) — 1/e. Hence, the total communication of this approach 
is at most 0{m/e), which improves upon sending all the samples 
since usually we have m <^ 1/e. However, an undesired conse- 
quence is that the estimator v(3;) will not be unbiased any more: 
E[v(a::)] could be en away from v(a:), since this method ignores 
all the small sample counts Sj (x) < etj . 

Below we detail a new, two-level sampling idea, which produces 
an unbiased estimator v(x-) for v(a;) with standard deviation 0{en) 
as in the basic random sampling algorithm, while improving com- 
munication cost to 0{^/m/e). The idea is to obtain an unbiased 
estimator s^(a;) of s{x), instead of sending all Sj{xys to compute 
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Basic Sampling, where p = 
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Figure 3: Two-level sampling at mapper. 

s(a;) exactly. We then use?(a;) to produce v(x-). We perform an- 
other level of sampling on the local frequency vector Sj of sampled 
keys for each split j. Specifically, we sample each key x in the 
sample with probability minjey'm • s_, (x) , 1}. More precisely, for 
any x with 8^(2;) > l/{e^/rn), we emit the pair {x,Sj{x)); for 
any x with < Sj{x) < l/{e^/rn), we sample it with a probabil- 
ity proportional to Sj{x), i.e., e^m • 8^(2:), and emit the pair (x, 
NULL) if it is sampled (for an example please see Figure 3). Note 
that in two-level sampling we do not throw away sampled items 
with small frequencies completely, as what is done in the Improved 
sampling method. Rather, these items are still given a chance to 
survive in the second-level sample, by sampling them proportional 
to their frequencies relative to the threshold l/e^/m (which is es- 
tablished from our analysis below). 

Next, we show how to construct from the emitted pairs from all 
splits, an unbiased estimator of s(a;) for any key x G [it] with 
standard deviation at most 1/e. As s(a;) = X^jLi we add 

up all sample count {x,Sj{x)) pairs received for x. They do not 
introduce any error, and we denote this partial sum as p{x). If a 
split has Sj{x) < l/{e^/rn), it will not emit Sj{x), but simply 
emit (x, NULL) if it is sampled. Suppose we receive M such pairs 
for X. Then our estimator is 



3(2;) = p{x) + M/{e^) 



(1) 



(for an example of how we compute s{x) at the reducer please see 
Figure 4). 

Theorem 1 s{x) is an unbiased estimator ofs{x) with standard 
deviation at most 1/e. 

Proof. Without loss of generality, assume in the first m' splits 
Sj{x) < l/(e^). Write M as M = where Xj = 1 if 

X is sampled in split j and otherwise. Each Xj is an independent 
Bernoulli trial, so 

EfJfj] = e\pm ■ Sj (x), and 

Var[Xj] = e\/^-Sj{x){\ — E\fm-^j[x)) < e\/rn ■ Sj{x). (2) 
Thus we have 
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lfSj(x)>l/{E^),emit (x.Sj(x)). 
Else emit (x, null) with probability 
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emitted pairs from S; 



1 initialize p(x) = 0. M = 0. 

- If (x. Sy(x)) received, /j(x) = //{x) ^ sy(x}. 

- Else if (x, null) received, M = M + 1. 
.s(x) = p(x) + M/Ev^. 



Figure 4: Two-level sampling at reducer. 

From s{x), we can estimate v(a::) as v(a;) = s{x)/p (recall that 
p — l/(e^n) is the sampling probability of the first level random 
sample in each split). It will be an unbiased estimator of v(a;) with 
standard deviation (l/e)/p = en. 

Corollary 1 'vix) is an unbiased estimator ofv{x) with standard 
deviation at most en. 

Corollary 1 gives a bound on the error of the estimated frequen- 
cies. Below we also analyze the error in the computed wavelet 
coefficients. Consider the coefficient to^ — (v, i/ji), where i/^i = 
(— 0j+i,2fe + 0j+i,2fc+i)/\/u/2^ is the corresponding wavelet ba- 
sis vector (see discussion in Section 2.1). From the estimated fre- 
quency vector V, we estimate Wi as Wi — (v, ^i). Since v(a;) for 
every x is unbiased, Wi is also an unbiased estimator of Wi. Recall 
thatViCx) = -1,+lforx = 2fcu/2^+^ + l, . . . , (2fc + 2)u/2^+\ 
so the variance of Qi is 



VarfwJ = — Var[v(a;)l 

(2fe+2)ii/2^ + '- 

Var[s(a::)]/p' 

i: = 2fcii/2j + l+l 
(2fe+2)ii/2^ + l 



2' 
u 

V 

u 



— V Var[Afl/(e^mp2) 

2! = 2fcll/2j + l+l 



2^n 



(2fe+2)ii/2J + l 



< — V e^s{x) (by (4)) 



eVn 



{2fc+2)ti/2J + l 



(5) 



a; = 2feu/2J + l + l 



Note that X]i^2fcu^/2J>i+i J*^^' number of keys 

covered by the wavelet basis vector This discussion leads to the 
next result: 



E[Af] = Ye^-Sj{x) = eyMis{x) - p{x)), (3) 

i.e., E[s(a;)] = s(a;) combining (1) and (3). 
Next, from (2), we have 



Theorem 2 The two-level sampling method provides an unbiased 
estimator Wifor any wavelet coefficient Wi, and the variance ofwi 
is bounded by (5). 

Finally, it remains to bound its communication cost. 



Var[M] = E Var[Xj] = e^m ■ ^ Sj{x). (4) 

Since each 8^(2:) < l/(£y'm), Var[M] is at most m' . Thus, 
the variance of s^(a;) is Va.r[M / (e^/rn)] = Var[M]/(e^m). So 
Var[?(a::)] < m' /{e^m) < l/e^, namely, the standard deviation 
is at most 1/e. □ 



Theorem 3 The expected total communication cost of our two- 
level sampling algorithm is 0{y/m/e). 

Proof. The expected total sample size of first-level sampling is 
pn = 1/e^. Thus, there are at most (1/e^)/ {l/{e^/rn)) = yprnje 
keys with Sj (a;) > 1/ (e^/rn) across all splits. These keys must be 
emitted for second level sampling. For any key x in any split j with 
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Sj{x) < \l(e^fm), we emit it with probability e^/m • Sj{x), so 
the expected total number of sampled keys for this category is 

e\frn ■ Sj (x) < ey''^ ■ 1 /^^ ~ V^/s- 

i X 

So the total number of emitted keys is 0{^/rn/e). □ 

Consider typical values; m = 10"^, e — 10"'' and 4-byte keys. 
Basic sampling emits 1 /e^ ~ 400MB; improved sampling emits at 
most m/e ~ 40MB; while two-level sampling emits about ^/rn/e 
« 1.2MB of data — a 330-fold or 33-fold reduction, respectively! 

Remark: In our second-level sampling, the sampling probabil- 
ity depends on the frequency, so that "important" items are more 
likely to be sampled. This falls into the general umbrella of "im- 
portance sampling" [35], and has been used for frequency estima- 
tion on distributed data [23,39]. However, its application to wavelet 
histograms and the corresponding variance analysis are new. 

Multi-dimensional wavelets. Our algorithm extends to building 
multi-dimensional wavelet histograms naturally. In d dimensions, 
frequency vector v is a d-dimensional array, and frequency array 
s of a random sample of the dataset still approximates v. So the 
problem boils down to how well s approximates v (note our two- 
level sampling algorithm does not affect the approximation error 
of the sample). However, because data is usually sparse in higher 
dimensions, the quality of the sample may not be as good as in 
one dimension. In fact, the standard deviation of the estimated fre- 
quency for any v(a::) {x is now a cell in \u\'^) from a sample of 
size 0(l/e^) is still 0(en), but due to the sparsity of the data, 
all the v(a::)'s may be small, so the relative eiTor becomes larger. 
This is, unfortunately, an inherent problem with sparse data: if all 
v(a;)'s are small, say or 1, then random sampling, and in gen- 
eral any sublinear method, cannot possibly achieve small relative 
errors [14]. One remedy is to lower the granularity of the data, i.e., 
project the data to a smaller grid [u/t]'' for some appropriate t so 
as to increase the density of the data. 

System issues. Among the three general approximation strategies 
mentioned at the beginning of Section 4, implementing the approx- 
imate TPUT methods (such as KLEE [28]) in Hadoop requires at 
least three rounds of MapReduce, which involves too much over- 
head for just approximating a wavelet histogram. Wavelet sketches 
can be easily implemented in Hadoop. The idea is to run one Map- 
per per split, which builds a local wavelet sketch for the split and 
emits the non-zero entries in the sketch to the Reducer. The Re- 
ducer then combines these m sketches and estimates the top-fc coef- 
ficients from the combined sketch. There are two wavelet sketches 
in the literature: the AMS sketch [4,20] and the GCS sketch [13]. 
The latter was shown to have better performance, so we choose it 
to implement in Hadoop. There are some technical details in opti- 
mizing its implementation in Hadoop, which we omit here. 

The third strategy, random sampling, clearly has better perfor- 
mance as it avoids scanning the entire dataset and is also easy to 
implement in Hadoop. Our two-level sampling algorithm in addi- 
tion achieves very low communication cost. We detail how we ad- 
dress some system issues, overcome the challenges, and implement 
two-level sampling in Hadoop in Appendix B; implementation of 
the other sampling algorithms is even simpler. 

5. EXPERIMENTS 

We implement all algorithms in Hadoop and empirically evaluate 
their performance, in both end-to-end running time and communi- 
cation cost. For the exact methods, we denote the baseline solution 



of sending all local frequency vectors (the Vj's of all splits) in Sec- 
tion 3 as Send-V, the baseline solution of sending the local wavelet 
coefficients (the Wij's of all splits) in Section 3 as Send-Coef, and 
our new algorithm as H-WTopk (meaning "Hadoop wavelet top- 
k"). For the approximate algorithms, we denote the basic sampling 
method as Basic-S, the improved sampling method as Improved-S, 
and the two-level sampling method as TwoLevel-S. Note Improved- 
S is based on the same idea as Basic-S, but offers strictly better 
performance, which we derived in Section 4. Given this fact, we 
choose to utilize Improved-S as the default competitor of TwoLevel- 
S. We also implement the sketch-based approximation method as 
discussed in Section 4. We use the GCS-sketch which is the state- 
of-the-art sketching technique for wavelet approximations [13]. We 
denote this method as Send-Sketch. We did not attempt to modify 
the approximate TPUT methods (such as KLEE [28]) to work with 
negative values and adapt them to MapReduce, since they generally 
require multiple rounds and scanning the entire datasets, which will 
be strictly worse than other approximation methods. 

Setup and datasets. All experiments are performed on a heteroge- 
neous Hadoop cluster running the latest stable version of Hadoop, 
version 0.20.2. The cluster consists of 16 machines with four dif- 
ferent configurations: (1)9 machines with 2GB of RAM and one 
Intel Xeon 5120 1.86GHz CPU, (2) 4 machines with 4GB of RAM 
and one Intel Xeon E5405 2GHz CPU, (3) 2 machines with 6GB 
of RAM and one Intel Xeon E5506 2.13GHz CPU, and (4) 1 ma- 
chine with 2GB of RAM and one Intel Core 2 6300 1.86GHz CPU. 
Our master runs on a machine with configuration (2) and we select 
one of the machines of configuration (3) to run the (only) Reducer. 
We configure Hadoop to use 300GB of hard drive space on each 
slave and allocate 1GB memory per Hadoop daemon. We have one 
TaskTracker and one DataNode daemon running on each slave, and 
a single NameNode and JobTracker daemon on the master. All ma- 
chines are directly connected to a lOOMbps switch. 

For the datasets, clearly, the determining parameters are n, the 
total number of records, which corresponds to the size of the in- 
put file, and u, the domain size, as well as the skewness. Note it 
only makes sense to use a dataset which is at least tens of gigabytes 
and has a domain size on the order of 2^". Otherwise a central- 
ized approach would work just fine, and the overhead of running 
MapReduce could actually lead to worse performance [15]. 

That said, for real datasets, we test all algorithms on the World- 
Cup [6] dataset which is the access logs of 92 days from the 1998 
World Cup servers, a total of approximately 1.35 billion records. 
Each record consists of 10 4-byte integer values including month, 
day, and time of access as well as the client id, object id, size, 
method, status, and accessed server. We assign to each record a 
4-byte identifier clientobject, which uniquely identifies a distinct 
client id and object id pairing. The object id uniquely identifies a 
URL referencing an object stored on the World Cup servers, such 
as a page or image. The pairing of the client id and the object id 
is useful to analyze the correlation between clients and resources 
from the World Cup servers, under the same motivation as that in 
the more common example of using the (src ip, dest ip) pairing in a 
network traffic analysis scenario. There are approximately 400 mil- 
lion distinct client id object id combinations, so the domain of this 
key value is approximately 2^", i.e. u = 2^". We store WorldCup 
in binary format, and in total the stored dataset is 50GB. 

To model the behavior of a broad range of real large datasets, 
we also generate datasets following the Zipfian distribution (since 
most real datasets, e.g., the clientobject in WorldCup, are skewed 
with different levels of skewness), with various degrees of skew- 
ness Q, as well as different u and n. We randomly permute keys 
in a dataset to ensure the same keys do not appear contiguously in 
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the input file. Each dataset is stored in binary format and contain 
records with only a 4-byte integer key. Unless otherwise specified, 
we use the Zipfian dataset as our default dataset to vigorously test 
all approaches on a variety of parameters on large scale data. 

We vary a in {0.8, 1.1, 1.4} and logj u in { 8, 11, 14, 17, 20, 23, 
26, 29, 32 }. We vary input file size from 10GB to 200GB resulting 
in different n from 2.7 to 54 billion. We vary the size of a record 
from 4-bytes to lOOkB. For all algorithms, we use 4-byte integers 
to represent v(a::) in a Mapper and 8-byte integers in a Reducer. We 
represent wavelet coefficients and sketch entries as 8-byte doubles. 

For all experiments, we vary one parameter while keeping the 
others fixed at their default values. Our default q is 1.1 and logj u 
is 29. The default dataset size is 50GB (so the default n is 13.4 
billion). The default record size is 4-bytes. We compute the best 
fc-term wavelet histogram with = 30 by default, which also 
varies from 10 to 50. The default split size j3 is 256MB, which 
varies from 64MB to 512MB. Note that the number of splits is 
m = 4n/(1024^^) (so the default m is 200). We also simulate a 
live MapReduce cluster running in a large data center where typi- 
cally multiple MapReduce jobs are running at the same time, which 
share the network bandwidth. Thus, the default available network 
bandwidth is set to 50% (i.e., 50Mbps) but we also vary it from 
10% to 100%. Note, we omit the results for Send-Coef on all ex- 
periments except for varying the domain it of the input dataset as it 
performs strictly worse than Send-V for other experiments. 

The exact methods have no parameters to tune. For Send-Sketch, 
we use a recommended setting for the GCS-sketch from [13], where 
each sketch is allocated 20KB- logj u space. We use GCS-8 which 
has the overall best per-item update cost and a reasonable query 
time to obtain the final coefficients. We also did the following op- 
timizations: First, for each split, we compute the local frequency 
vector Vj, and then insert the keys into the sketch so we update the 
sketch only once for each distinct key. Second, we only send non- 
zero entries in a local sketch to the Reducer For the two sampling 
methods, the default e is 10^*, and we vary it from 10~^ to 10~^. 
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Figure 5: Cost analysis: vary k. 

Results on varying k. We first study the effect of k, i.e., the size 
of the wavelet histogram to be computed. Figure 5 shows the effect 
of varying k on the communication cost and running time of all al- 
gorithms. The results show k has little impact on performance, ex- 
cept for the communication cost of H-WTopk. This is expected, as 
Send-V (resp. Send-Sketch) always compute and send out all local 
frequency vectors (resp. their sketches). The sampling methods are 
also unaffected by k as the sampling rate is solely determined by m 
and e. However, H-WTopk's communication cost is closely related 
to k, as it determines thresholds Ti and T2 for pruning items. 

For the exact methods, H-WTopk outperforms Send-V by orders 
of magnitude, in both communication and running time. It also out- 
performs Send-Sketch, which is an approximate method. The two 
sampling algorithms are clearly the overall winners. Nevertheless, 
among the two, in addition to a shorter running time, TwoLevel- 
S reduces communication to 10%-20% compared to Improved- 
S. Recall our analysis indicates an O(v'm) -factor reduction from 
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Figure 8: Cost analysis: vary e. 

Improved-S to TwoLevel-S; but this assumes any input data. Due 
to the skewness of the Zipfian data distribution, Improved-S actu- 
ally combines many keys into one key-value pair, and thus typically 
does not reach its 0{m/e) upper bound on communication. Over- 
all, the sampling algorithms have impressive performance: On this 
50GB dataset, TwoLeveJ-S incurs only 1MB communication and 
finishes in less than 3 minutes. In contrast, Send-Sketch takes about 
10 hours (most time is spent updating local sketches), Send-V about 
2 hours (mostly busy communicating data), and H-WTopk 33 min- 
utes (scanning inputs plus overhead for 3 rounds of MapReduce). 

We must ensure the efficiency gain of the sampling methods does 
not come with a major loss of quality. Thus, we examine the sum 
of squared error (SSE) between the frequency vector reconstructed 
from the wavelet histogram and that of the original dataset. The 
results are shown in Figure 6. Since Send-V waA H-WTopk are ex- 
act methods, they represent the best possible reconstruction using 
any fc-term wavelet representation. So their curves are identical in 
Figure 6 and represent the ideal error for measuring the accuracy 
of the approximation methods. Clearly, when k increases, the SSEs 
of all methods decrease. Among the three approximation methods, 
TwoLevel-S returns wavelet histograms which come very close to 
the ideal SSE. Improved-S has the worst SSE as it is not an unbi- 
ased estimator for v, and the gap from the ideal SSE widens as k 
gets larger, as it is not good at capturing the details of the frequency 
vector Send-Sketch' i SSE is between TwoLevel-S and Improved-S. 
Even though the SSE looks large in terms of absolute values, it is 
actually quite small considering the gigantic dataset size. When 
k > 30, the SSE is less than 1% of the original dataset's energy. 
Varying e. Next we explore the impact of e on all sampling meth- 
ods, by varying it from 10~^ to 10~^ in Figure 7. In all cases, 
TwoLevel-S consistently achieves significantly better accuracy than 
Improved-S, as the first is an unbiased estimator of v while the lat- 
ter is not. Both methods have larger SSEs when e increases, with 
e — 10~* achieving a reasonable balance between the SSE and ef- 
ficiency (to be shown next), hence it is chosen as the default. Figure 
8 shows all sampling methods have higher costs when e decreases 
(from right to left). In all cases, TwoLevel-S has significantly lower 
communication cost than Improved-S as seen in Figure 8(a). It 
also has a lower running time compared to Improved-S as shown 
in Figure 8(b). In a busy data center where network bandwidth is 
shared by many concurrent jobs, the savings in communication by 
TwoLevel-S will prove to be critical and the gap for the running 
time will widen even more. 
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In what follows, we omit the results on SSEs when we vary the 
other parameters, as they have less impact on the SSEs of various 
methods, and the relative trends on SSEs for all methods are always 
similar to those reported in Figures 6 and 7. 
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Figure 9: Communication and running time versus SSE. 

Comparing SSE. For the next experiment we analyze the com- 
munication and computation overheads of all approximation algo- 
rithms to achieve a similar SSE in Figure 9, where the defaults 
of all algorithms are circled. In Figure 9(a) we see that the com- 
munication cost increases as the SSE decreases for all algorithms. 
TwoLevel-S achieves the best SSE to communication cost, and com- 
municates at least an order of magnitude less than Improved-S and 
two orders of magnitude less than Send-Sketch to achieve a simi- 
lar SSE. Among the algorithms, TwoLevel-S is the most efficient 
in terms of running time, achieving a similar SSE to Send-Sketch 
in orders of magnitude less time and approximately 2-3 times less 
time than Improved-S, as shown in Figure 9(b). These results also 
indicate the sketch size selected at 20kB * log2(w) is most compet- 
itive against the sampling based algorithms, justifying our choice 
for using it as the default value for the GCS-sketch. 
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Figure 10: Cost analysis: vary n. 

Varying dataset size n. Next, we analyze the scalability of all 
methods by varying n, or equivalently the dataset size. Note as n 
increases, so does m, the number of splits. This explains the gen- 
eral trends in Figure 10 for both communication and running times. 
There are two points worth pointing out. First, TwoLevel-S outper- 
forms Improved-S by a larger margin in terms of communication 
cost for larger datasets due to the O(y'm) -factor difference, which 
becomes more than one order of magnitude when the data becomes 
200GB. Second, the increase in m leads to longer running times 
of all methods, but the two sampling algorithms are much less af- 
fected. The reason is the sampling algorithms mainly have two 
kinds of running time costs: overheads associated with process- 
ing each split (i.e.. Mapper initialization), which linearly depends 
on m, and sampling overheads where the sample size is always 
0(l/e'^), which is independent of n. The net effect of these costs 
is a slow growth in running time. Overall, H-WTopk and TwoLevel- 
S are clearly the best exact and approximate methods, respectively. 

Varying record size. In Figure 1 1 we analyze the effect varying 
the record size has on the performance of all algorithms. We fix the 
number of records as 4,194,304 (which is the number of records 
when the total dataset reaches 400GB with lOOkB per record), for 
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Figure 11: Cost analysis: vary record size. 

the default Zipfian dataset, and vary the size of a record from 4 
bytes (key only) to lOOkB, which corresponds to a dataset size of 
16MB to 400GB consisting of 1 and 1600 splits respectively. We 
see the communication cost increases for all methods as the record 
size increases. This makes sense since increasing the number of 
splits has a negative impact to all of their communication costs. 
Nevertheless, even with 1600 splits when the record size is lOOkB 
H-WTopk still communicates less than Send-V\ and TwoLevel-S still 
outperforms the other algorithms by orders of magnitude with re- 
spect to communication. 

The running time of all algorithms also increases as the record 
size increases, while the total number of records is fixed. This is 
not surprising due to several factors when the record size increases: 
1) all algorithms communicate more data; 2) there are much more 
splits than the number of slaves in our cluster; 3) the 10 cost be- 
comes higher. Note that regardless of the record size H-WTopk 
still performs better than Send-V. We also note that the clear win- 
ner is TwoLevel-S with a running time roughly an order of mag- 
nitude better than Send-V. Finally, the performance gap between 
H-WTopk and Send-V, as well as the gap between TwoLevel-S and 
Improved-S, are not as significant as in other experiments. This is 
mostly due to the small number of records (only 4 million, in con- 
trast to 13.4 billion in the default zipfian dataset and 1.35 billion 
in the WorldCup dataset) we have to use in this study, which is 
constrained by the number of records we can accommodate for the 
maximum record size (lOOkB), while still keeping the total file size 
under control (400GB when each record becomes lOOkB). 
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Figure 12: Cost analysis: vary u. 

Varying domain size u. We next examine how u affects all algo- 
rithms. Note as we increase u while keeping n fixed, the tail of the 
data distribution gets longer while frequent elements get slightly 
less frequent. Figure 12(a) shows this affects Send-V, which is ob- 
vious as each local frequency vector Vj gets more entries. We note 
that 5e/trf-V performs better than Send-Coef for all tested domains. 
Send-Coef reduces the computational burden at the reducer by per- 
forming the wavelet transform in parallel over the local frequency 
vectors. However, the results indicate that the potential savings 
from computing the wavelet transform in parallel is canceled out 
by the increase in communication and computation cost of Send- 
Coef o\er Send-V. The overheads in Send-Coef are caused by the 
fact that the number of local wavelet coefficients grows linearly to 
the domain size, regardless of the size of each split and how many 
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records a local split contains. Thus, with the increasing domain 
size, the communication cost and the overall running time of this 
approach quickly degrade. Indeed, the total non-zero local wavelet 
coefficients are almost always much greater than the total number 
of keys in the local frequency vector with a non-zero frequency. 
Since Send-V always results in less communication and computa- 
tion overheads than Send-Coef, we use Send-V d& our default base- 
line algorithm for all other experiments. In terms of running time, 
larger u makes all methods slower except the sampling-based al- 
gorithms. Send-V, Send-Coef, H-WTopk and Send-Sketch all more 
or less linearly depend on u: Send-V and Send-Coef we. obvious; 
H-WTopk needs 0{u) time to compute the wavelet transformation 
for each v^; while Send-Sketch needs to make 0{u) updates to the 
sketch. The two sampling algorithms are not affected as their sam- 
ple size is independent of u. 
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Figure 13: Cost analysis: vary split size /?. 

Varying split size /?. Figure 13 shows the effect of varying the split 
size l3 from 64MB to 512MB while keeping n fixed. The number of 
splits m drops for larger split sizes (varying from 800 to 100 for the 
50GB dataset). Hence, the communication cost of all algorithms 
drop with a larger split size. This is essentially the opposite of 
Figure 10(a) where we increase n (hence m) for a fixed split size. 
The difference is, for Send-V, the communication is not reduced as 
much, since as the split gets larger, there are more distinct keys in 
each split, which cancels some benefit of a smaller m. 

The running times of all methods reduce slightly as well for 
larger split size, because Send-V has less communication overhead, 
H-WTopk has to perform less local wavelet transformations, and 
Send-Sketch has less updates to the local sketches. For the two sam- 
pling algorithms, although their sample size does not depend on m, 
the communication (hence the cost of the Reducer who needs to 
process all the incoming messages) reduces as m gets smaller. 

All these seem to suggest we should use a split size as large as 
possible. However, there is a limit on the split size, constrained 
by the available local disk space (so that a split does not span over 
multiple machines, which would incur significant communication 
cost when processing such a split). In addition, larger split sizes 
reduce the granularity of scheduling and increase the overhead of 
failure recovery. On our cluster with 16 machines, these issues do 
not manifest. But on large clusters with thousands of machines, the 
split size should not be set too large. So the typical split size as 
recommended by most works in the literature (e.g. [2, 25, 38]) is 
either 128MB or 256MB. 

Varying data skewness a. We also study the effect of data skew- 
ness a, with a as 0.8, 1.1, 1.4 and show results in Figure 14 and 
15. When data is less skewed, each split has more distinct key 
values. As a result, the communication cost of Send-V is higher, 
leading to higher running time. The running time of Send-Sketch 
becomes more expensive as more local sketch updates are neces- 
sary. The communication and running time of other methods have 
little changes. The SSE is analyzed in Figure 15. All methods' SSE 
seem to improve on less skewed data. Nevertheless, TwoLevel-S 
consistently performs the best among all approximation methods. 
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Figure 14: Cost analysis: vary skewness a. 
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Figure 15: Vary a SSE. 
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Varying bandwidth B. Finally, Figure 16 shows the effect the 
bandwidth B has on the running time of all methods, by vary- 
ing it from 10% to 100% of the full network bandwidth which 
is 100Mbps. The communication cost of all algorithms are unaf- 
fected by B. Send-V enjoys an almost linear reduction in running 
time when B increases as transmitting data dominates its running 
time. Other methods see a slight reduction in their respective run- 
ning times. 
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Figure 17: Cost analysis: WorldCup dataset. 

WorldCup Dataset. Figure 17 analyzes the performance of all al- 
gorithms on WorldCup using default k, e, /3, and B values, in which 
we attempt to compute the best fc-term wavelet representation over 
the clientobject attribute. Notice in Figure 17(a) the communica- 
tion trends for all algorithms are similar to our previous observa- 
tions. We note the WorldCup dataset is approximately 50GB with 
almost 2^^ distinct clientobject values, which are the defaults used 
for the Zipfian datasets. Send-V s communication cost is dependent 
on two primary factors: The skewness of the data and the total num- 
ber of distinct values. As the data becomes more skewed, Send-V 
can leverage on the Combine function to reduce communication. 
However, as we see in Figure 17(a) Send-V requires roughly the 
same amount of communication as for the Zipfian datasets. This 
indicates by varying a, u and n for the Zipfian datasets we can 
approximate the distribution of real large datasets fairly well. 

In Figure 17(b) we observe the running times of all approaches 
on WorldCup. Send-Vs running time is mainly dependent on its 
communication cost. The data communicated is about the same 
as the default Zipfian dataset so it is not surprising Send-V pre- 
forms similarly on the WorldCup dataset. We would like to note 
TwoLevel-S saves almost 2 orders of magnitude and H-WTopk saves 
about 0.5-1 order of magnitude over Send-V indicating our algo- 
rithms are effective on large real datasets. 
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Figure 18: SSE on WorldCup. 
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Figure 19: Comm. & running time vs SSE on WorldCup. 

We observe the SSE on WorldCup in Figure 18. The relative 
performance of various algorithms are similar to the previously ob- 
served trends for Zipfian datasets in Figure 15. We also analyze 
the communication and running time of all algorithms versus the 
SSE on WorldCup in Figure 19. The trends are again similar to that 
in Figure 9 for Zipfian datasets. Notice the Send-Sketch method 
achieves a similar SSE, with at least an order of magnitude more 
communication and orders of magnitude more computation over- 
heads than other methods. We observe that TwoLevel-S achieves the 
best overall SSE to communication cost, requiring approximately 
an order of magnitude less communication than other methods. In 
addition, TwoLevel-S is also 2-3 times or orders of magnitude faster 
than other methods to achieve a similar SSE. 

Experimental conclusion. These extensive results reach the clear 
conclusion H-WTopk is the choice if we wish to find exact top- 
k wavelet coefficients, outperforming the baseline exact method 
Send-Vby several orders of magnitude in communication, and 0.5- 
1 order of magnitude in running time; when approximation is al- 
lowed, TwoLevel-S is the best method. Not only does it offer the 
cleanest solution, but it also achieves an SSE nearly as good as 
exact methods for a tiny fraction of their communication cost and 
running time. In addition, it achieves the best overall communica- 
tion and running time to achieve an SSE similar to other sampling 
and sketching techniques. It produces an approximate wavelet his- 
togram of high approximation quality for 200GB data of domain 
size of 2^^ in less than 10 minutes with only 2MB communication! 

6. RELATED WORK 

The wavelet histogram and wavelet analysis, introduced to data 
management for selectivity estimation by Matias et al. [26], has 
quickly emerged as a widely used tool in databases, data mining, 
and data analysis [3, 9, 21, 33]. Matias et al. have also studied 
how to dynamically maintain the wavelet histograms under updates 
[27]. Gilbert et al. [20] extended the construction of the wavelet 
histogram to streaming data, using the AMS sketch [4]. Cormode et 
al. [13] then improved the efficiency of the sketch with the Group- 
Count Sketch (GCS). 

Many types of histograms exist. Poosala et al. [32] presented 
an excellent discussion on the properties of various histograms. 
How to efficiently build other types of histograms for large data in 
MapReduce is an intriguing open problem we plan to investigate. 

Since its introduction [15], MapReduce has quickly become a 
primary framework for processing massive data. It represents the 



trend of going towards parallel and distributed processing on shared- 
nothing commodity clusters [8, 10,31]. Significant effort has been 
devoted to improving the efficiency, the functionality and query 
processing in MapReduce, e.g., Amazon EC2 [5], HadoopDB [1], 
Hadoop-H- [16], Hadoop [22], MapReduce Online [11], and many 
others [12]. To the best of our knowledge, efficient construction of 
wavelet histograms in MapReduce has not been studied. 

Our work is also related to finding distributed top- A; and frequent 
items. The best exact method for distributed top-fc is TPUT [7]. 
However, it (and other methods, e.g., [28]) does not support find- 
ing the aggregates with the largest absolute values over positive 
and negative value sets. Our exact algorithm shares the basic prin- 
ciple in distributed query processing, however, comes with novel 
designs in order to work for wavelets in MapReduce. The approx- 
imate distributed top-fc query has been studied in [28] and many 
others. However, they also only support non-negative scores and 
require multiple rounds, which introduce considerable overhead in 
the MapReduce framework. As such, we did not attempt to adapt 
them as approximation methods. Instead, for approximation meth- 
ods, we focus on algorithms that require only one round of MapRe- 
duce. The most related works are methods for finding heavy hitters 
from distributed datasets [39]. But they are not tailored for the 
MapReduce environment, and use complicated heuristics that are 
hard to implement efficiently in Hadoop. There is also a study on 
finding the top-fc largest valued items in MapReduce [34], where 
each item has a single total score, which is clearly different from 
(and does not help) our case. 

7. CLOSING REMARKS 

Massive data is increasingly being stored and processed in Map- 
Reduce clusters, and this paper studies how to summarize this mas- 
sive data using wavelet histograms. We designed both exact and 
approximation methods in MapReduce, which significantly outper- 
form the straightforward adaptations of existing methods to MapRe- 
duce. Our methods are also easy to implement, in particular the 
two-level sampling method, making them appealing in practice. 

Data summarization is an important technique for analyzing large 
datasets. The wavelet histogram is merely one representative, and 
there are many other types of summaries we may consider in the 
MapReduce model, such as other kinds of histograms (e.g., the 
V-optimal histogram [24]), various sketches and synopses, geo- 
metric summaries like e-approximations and coresets, and graph 
summaries like distance oracles. Another open problem is how to 
incrementally maintain the summary when the data stored in the 
MapReduce cluster is being updated. Finally, data summarization 
in MapReduce is also an intellectually challenging problem, requir- 
ing a good blend of algorithmic techniques and system building. 
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APPENDIX 

A. SYSTEM ISSUES OF H-WTOPK 

IVIapReduce Round 1: In Round 1, a Mapper first computes local 
frequency vector Vj for split j by using a hashmap to aggregate the 
total count for each key encountered as the records in the split are 
scanned. After Vj is constructed, we compute its wavelet coeffi- 
cients in the Close interface of the IVIapper. Since the number of 
nonzero entries in Vj, denoted as jvj|, is typically much smaller 
than u, instead of running the 0(it)-time algorithm of [26], we use 
the 0(|vj| logti)-time and 0(log it)-memory algorithm of [20]. 
During the computation we also keep two priority queues to store 
the top-fc and bottom-fc wavelet coefficients. 

After all coefficients for split j have been computed, the Mapper 
emits an intermediate key- value pair (i, {j,'Wi,j)) for each of the 
top-fc and bottom-fc wavelet coefficients Wij of the split. In the 
emitted pairs, the Mapper marks the fc-th highest and the k-th low- 
est coefficients using (i, (j + m,Wi,j)) and {i, (j + 2m, re- 
spectively. Finally, the Mapper saves all the unemitted coefficients 
as state information to an HDFS file associated with split j. 

After the Map phase, the Reducer receives the top-A; and bottom- 
k wavelet coefficients from all the splits, 2km of them in total. 
We denote by R the set of distinct indices of the received coeffi- 
cients. For each index i £ R, The Reducer passes the correspond- 
ing (j, Wi,j)'?, received to a Reduce function, which adds up these 
Wij's, forming a partial sum Wi for Wi. Meanwhile we construct 
a bit vector Ft of size m such that Fi{j) = if Wi^j has been re- 
ceived and Fi{j) = 1 if not. While examining the (j, Wi^jYs in the 
Reduce function, if we encounter a marked pair, we remember it so 
the fc-th highest and the fc-th lowest coefficient from each split j, 
denoted as w+ and wj , can be obtained. 

After we have all partial sums Wi for all i £ R, and , w~ 
for all j, we compute upper bound (resp. lower bound t^) on 
Wi, by adding SJLi Fi{j)wf (resp. J2]Li Fi{j)w~) to Wi. Then 
we obtain a lower bound Ti on \ wi\, hence Ti, as described in Sec- 
tion 3. Finally, we save tuple (i, Wi,Fi) for all i £ R, and Ti as 
state information in a local file on the Reducer machine. 

MapReduce Round 2: To start Round 2, Ti /m is first set as a 
variable in the Job Configuration. In this round, for the Map phase 
we define an alternate InputFormat so a Mapper does not read an 
input split at all. Instead, a Mapper simply reads state informa- 
tion, i.e., all wavelet coefficients not sent in Round 1, one by one. 
For any Wi,j such that > Ti/m, a Mapper emits the pair 
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The Reducer first reads tuple {i,Wi, Fi) for all i G -R from the 
local file written in Round 1. For each i, it passes all corresponding 
{j,Wi.j)'s received in this round to a Reduce function. Now we 
update partial sum Wi by adding these new coefficients, and update 
Fi correspondingly. We also refine upper bound (resp. lower 
bound rf) as r+ — Wi + \\Fi\\i ■ Ti/m (resp. r^" — wt — \\Fi\\i ■ 
Ti/m), where H-FiHi denotes the number of I's in Fi. 

With the updated , r ~ , we obtain a new T2 , which can be 
used to prune indices from R as described in Section 3. Lastly, the 
Reducer writes updated Wi for all i G _R in a local file, and the set 
of candidate indices R in an HDFS file. 

MapReduce Round 3: In Round 3, the master reads R from HDFS 
and adds it to the Distributed Cache. Like in Round 2, the Mappers 
still do not read the input splits. During initialization, each Mapper 
reads R from the distributed cache. Then, it reads from the state 
file storing the wavelet coefficients. For any Wij it checks if i G -R 
and \ wij\ < Ti/m. If so, it means it has not been communicated 
to the Reducer yet, and thus we emit (i, {j, Wij)). 

On the Reducer side, similar to Round 2, the Reducer first reads 
R and Wi for all i G R's from the local file. Then for each i, 
the Reduce function adds all newly received Wij's to Wi, yielding 
accurate Wi. Finally, we return the top-fc coefficients Wi of largest 
magnitude for i G i? as the best fc-term representation for v. 

B. SYSTEM ISSUES OF TWOLEVEL-S 

As before, we will have m Mappers, one per input split, and 1 
Reducer. The first issue is how to randomly read records from an 
input split. The default Hadoop RecordReader in InputFile format 
is designed to sequentially scan an input split. Hence, we define 
our own InputFile format RandomlnputFile, assuming each record 
in the input files has a fixed size. The RandomlnputFile defines 
a custom RecordReader, called RandomRecordReader, which can 
randomly sample records from an input split. A straightforward im- 
plementation is to simply seek to a random offset in the split when 
the Mapper requests the next record, but this requires seeking offset 
locations in both directions. Instead, we implement it as follows. 

When the RandomRecordReader is first initialized, it determines 
Uj , the number of records in the split. Next, it randomly selects prij 
offsets in the split, where p = l/(e^n) is the sample probability 
of the first-level sampling, and stores them in a priority queue Q 
sorted in ascending order. Afterwords, every time the Random- 
RecordReader is invoked by the Mapper to retrieve the next record 
from the split, it seeks to the record indicated by the next offset, and 
retrieves the record there. We continue this process iteratively until 
all puj random records have been obtained. Note in Section 4, we 
assume coin-flip sampling for sake of simpler analysis; here we use 
sampling without replacement. It has been observed coin-flip sam- 
pling and sampling without replacement behave almost the same 
for most sampling-based methods [37], and we observe this is also 
true for our sampling-based approaches. 

Using RandomlnputFile as the InputFile format, two-level sam- 
pling can be implemented in one round of MapReduce, as follows. 

Map phase. During initialization of the Map phase we specify n 
and e in the Job Configuration. With the RandomRecordReader, the 
MapRunner reads the pUj random records one at a time and invokes 
the Map function for each record, which simply maintains aggre- 
gated counts for keys of the sampled records. After the MapRun- 
ner has processed all sampled records, the Mapper's Close routine 
is called. It iterates over all sampled keys and checks their aggre- 



gate counts. If Sj{x) > l/{e^/rn), we emit the pair {x,Sj{x)). 
Otherwise, we emit {x, 0) with probability ey/m ■ Sj (x). 

Reduce phase. For each key x, the Reducer passes all correspond- 
ing {x,Sj{x)) or (a;,0) pairs to the Reduce function, which com- 
putes the estimated v{x) as described in Section 4. After all keys 
are processed by the Reducer, its Close method is invoked, where 
approximate wavelet coefficients are computed from approximate 
global frequency vector v. In the end, we emit {i, Wi) pairs for the 
top-A; approximate coefficients (with the k largest magnitudes). 

Remarks. In our discussion so far, our RandomRecordReader as- 
sumes fixed length records. However, it is easy to extend it to sup- 
port variable length records as well. Instead, assume records of 
variable length end with a 4-byte record length followed by a de- 
limiter character or byte sequence (e.g., a new line character). The 
RandomRecordReader initially generates prij random offsets and 
inserts them in a priority queue Q. It then processes offsets from 
Q one at a time. It seeks to an offset and scans forward until it 
finds the record length r and delimiter (this is easy to achieve with 
a few-bytes look-ahead buffer). The RandomRecordReader deter- 
mines the start offset o of the record, using r and the end offset of 
the record, and records an (o, r) pair in a Heap H sorted by o. It 
is possible some of the puj randomly selected offsets may point to 
the contents of the same record. Note RandomRecordReader pro- 
cesses sampled offset locations in (ascending) sorted order. Hence, 
whenever the RandomRecordReader processes an offset o, it deter- 
mines if its current split position is larger than o. If it is, o points 
to the contents of the same record as the last processed offset. In 
this case, RandomRecordReader randomly generates a new offset 
o' (to replace o) which does not point to locations covered by any 
(o, r) pairs in H; this is easy to ensure using the information in H. 
If o' appears before the cuiTent split location of the RandomRecor- 
dReader, the end split offset is added to o' . Then, the Random- 
RecordReader pushes o' onto the priority queue Q (which contains 
yet to be processed offsets) and continues with the next offset. If the 
end of the split is reached, and Q is not empty, the RandomRecor- 
dReader goes back to the head of the split to continue processing 
offsets in Q; first subtracting the end split offset from offsets in Q. 
The process is repeated until enough records are sampled (Q be- 
comes empty). At this point H contains sorted offsets of all sam- 
pled prij records (and their lengths). The RandomRecordReader 
seeks to and reads all records at offsets in H by ascending order. 

Another issue which must be addressed is how the Random- 
RecordReader should determine the number of records in the jth 
split Uj when records are variable length, so that it may determine 
the number of records which should be sampled. We can either 
assume that this information is made available when the datasets 
were initially loaded into the MapReduce cluster, or statistics such 
as the average, median, or minimum record sizes can be used to 
determine an appropriate rij depending on the application. Addi- 
tionally, in order to decrease the likeliness that any two of the prij 
randomly selected offsets point to the contents of the same record, 
it can be enforced that the offsets do not appear within the average, 
median, or minimum record size of each other. 

In the implementation of our exact and sampling methods, we 
choose to do the final processing in the close method, instead of 
using the combiner. This is a technicality due to the default be- 
havior of Hadoop, which runs the COMBINE function continuously 
while keys are being processed to save memory buffer space (leads 
to fewer disk writes). On the other hand, the close method is guar- 
anteed to run only once when all keys have been processed. 
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